4 research outputs found
Boosting ensembles with controlled emphasis intensity
Boosting ensembles have deserved much attention because their high performance. But they are also sensitive to adverse conditions, such as noisy environments or the presence of outliers. A way to fight against their degradation is to modify the forms of the emphasis weighting which is applied to train each new learner. In this paper, we propose to use a general form for that emphasis function, which not only includes an error dependent and a proximity to the classification boundary dependent term, but also a constant value which serves to control how much emphasis is applied. Two convex combinations are used to consider these terms, and this makes possible to control their relative influence. Experimental results support the effectiveness of this general form of boosting emphasis.This work has been partly supported by research grants CASI-CAM-CM (S2013/ICE-2845,DGUI-CM), and Macro-ADOBE ( TEC2015- 67719-P, MINECO )
Pre-emphasizing Binarized Ensembles to Improve Classification Performance
14th International Work-Conference on Artificial Neural Networks, IWANN 2017Machine ensembles are learning architectures that offer high expressive capacities and, consequently, remarkable performances. This is due to their high number of trainable parameters.In this paper, we explore and discuss whether binarization techniques are effective to improve standard diversification methods and if a simple additional trick, consisting in weighting the training examples, allows to obtain better results. Experimental results, for three selected classification problems, show that binarization permits that standard direct diversification methods (bagging, in particular) achieve better results, obtaining even more significant performance improvements when pre-emphasizing the training samples. Some research avenues that this finding opens are mentioned in the conclusions.This work has been partly supported by research grants CASI-CAM-CM (S2013/ICE-2845, DGUI-CM and FEDER) and Macro-ADOBE (TEC2015-67719-P, MINECO)
Enfatizado y diversificación en clasificación máquina
Las excepcionales capacidades de los métodos de Boosting, especialmente del algoritmo
Real AdaBoost (RAB), para resolver problemas de decisióon y clasificación
son universalmente conocidas. Estas buenas prestaciones provienen de la construcción progresiva de un conjunto de aprendices débiles e inestables, combinados de
forma lineal, que prestan más atención a las muestras de más difícil clasificación. Sin
embargo, el correspondiente énfasis que se aplica puede ser inadecuado, en particular,
en casos de elevados niveles de ruido o abundante presencia de muestras fuera
de margen (“outliers"). Para estos escenarios de trabajo, se han propuesto varias
modificaciones del algoritmo de Boosting básico para controlar la cantidad de énfasis
que se aplica, pero ninguna de estas modificaciones parece ofrecer los resultados
esperados cuando se trabaja con conjuntos de datos desequilibrados, en presencia de
outliers o con distribuciones de datos asimétricas.
Con esto en mente, en primer lugar, se propone en el Capítulo 2 una modificación
sencilla de la función de énfasis del algoritmo RAB estándar, que no solo tiene en
cuenta el error de la muestra a clasificar, sino también los errores de clasificación de
las muestras más próximas a ella. A continuación, se presenta en el Capítulo 3 una
generalización de la función de énfasis híbrido utilizada en versiones del algoritmo
RAB que ponderan (a través de un parámetro de mezcla) las muestras según su error
de clasificación y proximidad a la frontera. Esta nueva función de énfasis incluye un
término constante que sirve para moderar la intensidad de énfasis, o en otras palabras,
limitar la atención centrada en las muestras más próximas a la frontera o más difíciles
de clasificar. Los resultados obtenidos en el Capítulo 2 y Capítulo 3 indican que estas
modificaciones de las funciones de énfasis permiten alcanzar mejores prestaciones.
Posteriormente, en el Capítulo 4 se propone enfatizar los costes asociados a las
muestras de entrenamiento para mejorar los resultados de clasificación de conjuntos
basados en esquemas de diversificación estándar y binarización. Los resultados obtenidos
en este capítulo muestran cómo las técnicas de binarización permiten que métodos
de diversificación estándar (Bagging, concretamente) consigan alcanzar mejores prestaciones; y se obtienen mejoras mucho más significativas cuando previamente se
enfatizan las muestras de entrenamiento.
Esta Tesis Doctoral concluye enumerando las principales contribuciones de la
misma y con una sugerencia de líneas de investigación abiertas.The exceptional capabilities of Boosting methods, in particular of Real Adaboost
(RAB) ensembles, for solving decision and classification problems are universally
recognized. These capabilities come from progressively constructing unstable and
weak learners that pay more attention to samples that oppose more difficulties to
be correctly classified, and linearly combine them in a progressive manner. However,
the corresponding emphasis can be inappropriate, in particular, when there is an
intensive noise or in the presence of outliers. In this scenario, there are many proposed
modifications to control the emphasis but they show limited success for imbalanced
or asymmetric problems. A simple way to deal with these situations is to modify the
forms of the emphasis weighting which is applied to train each new learner.
Firstly, we propose in Chapter 2 a simple modification of the well-known RAB
emphasis function. The basic idea underlying this modification, which makes use
of the neighborhood concept to reduce the above drawbacks, is to emphasize the
samples according to their errors and those of their neighbors. Next, in Chapter
3, we propose a general form of the emphasis, which not only includes an error
dependent and a proximity to the classification dependent terms, but also a constant
value which serves to graduate the intensity of that mixed emphasis, limiting the
increased attention which is paid to highly erroneous samples and those samples
that are near to the boundary. Experimental results obtained in both Chapter 2 and
Chapter 3 support the effectiveness of these forms of Boosting emphasis.
In Chapter 4, it is proposed to weight the costs associated to training examples,
in order to improve the classification results of ensembles based on standard diversification and binarization techniques. Experimental results show that binarization
permits that standard direct diversification methods (Bagging, in particular) achieve
better results, and even more significant performance improvements are obtained
when pre-emphasizing the training samples.
This Doctoral Thesis concludes enumerating its main contributions and with
some suggestions of new research lines arising from this work.Programa Oficial de Doctorado en Multimedia y ComunicacionesPresidente: Luis Vergara Domínguez.- Secretario: Francisco Javier González Serrano.- Vocal: Alberto Suárez Gonzále
Word Sense Induction in the Arabic Language: A Self-Term Expansion Based Approach
Abstract. The aim of the word sense induction/discrimination task of natural language processing is to discover the sense associated to each instance of a given ambiguous word. In this paper we present an approach based on clustering of a self-expanded version of the original dataset in order to tackle this particular problem. The self-expansion technique substitutes every term of the original corpus with a set of co-related terms which is calculated by means of pointwise mutual information. Our proposal which was tested for the English language shows a good performance for the Arabic language too, highlighting its languageindependent characteristic